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When fitting hierarchical regression models, maximum likelihood (ML) esti- 
mation has computational (and, for some users, philosophical) advantages 
compared to full Bayesian inference, but when the number of groups is small, 
estimates of the covariance matrix () of group-level varying coefficients are 
often degenerate. One can do better, even from a purely point estimation per- 
spective, by using a prior distribution or penalty function. In this article, we use 
Bayes modal estimation to obtain positive definite covariance matrix estimates. 
We recommend a class of Wishart (not inverse-Wishart) priors for X with a 
default choice of hyperparameters, that is, the degrees of freedom are set equal 
to the number of varying coefficients plus 2, and the scale matrix is the identity 
matrix multiplied by a value that is large relative to the scale of the problem. 
This prior is equivalent to independent gamma priors for the eigenvalues of X 
with shape parameter 1.5 and rate parameter close to 0. It is also equivalent to 
independent gamma priors for the variances with the same hyperparameters 
multiplied by a function of the correlation coefficients. With this default prior, 
the posterior mode for X is always strictly positive definite. Furthermore, the 
resulting uncertainty for the fixed coefficients is less underestimated than under 
classical ML or restricted maximum likelihood estimation. We also suggest an 
extension of our method that can be used when stronger prior information is 
available for some of the variances or correlations. 
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Hierarchical or mixed-effects regression models are increasingly popular in 
applied statistics and can be viewed as Bayesian at the following two levels: A 
prior distribution is assigned to the varying coefficients, and the parameters of 
that prior distribution themselves are given a hyperprior. The family of models 
can be written in general terms as follows: Data are in groups j = 1,... J. For 
each group j, there is a response vector y; and two data matrices, X; and Z,, that 
have fixed and varying coefficients, respectively. The data model is 
p(y;|XjB + Zjb;), where B is the vector of fixed coefficients and b; is the vector 
of regression coefficients that varies by group. The vectors b; are modeled as 
independent draws from a prior distribution, p(b;), given some hyperparameters. 
We shall assume a normal model for the varying coefficients, so that b; ~ N(0, 2). 
The model could also include a nonzero mean vector or a group-level regression 
structure for the hyperprior distribution, but these can be folded into the fixed 
coefficients in the data model without loss of generality. 

There is a rich literature on full Bayesian inference for hierarchical regres- 
sions. There is also an empirical Bayes version in which the hyperparameters 
(in this case, X) are estimated via maximum likelihood (ML) and then inference 
for the coefficients is performed conditional on the estimated &. From the 
Bayesian perspective, the empirical Bayes approach is suboptimal, both 
because it avoids the use of any prior information on = and because it under- 
states posterior uncertainty. From a pragmatic perspective, however, we recog- 
nize that the point estimation approach has two advantages that give it great 
appeal to many users. First, existing software such as lme4 in R and various com- 
mands in Stata allow such models to be fit fast and reliably for moderate-sized 
data sets, whereas software for Markov chain Monte Carlo simulation for full 
Bayes inference is not yet so immediately practical. Second, the non-Bayesian 
motivation behind point estimation is attractive to practitioners who want the 
benefits of partial pooling and hierarchical modeling without needing to spe- 
cify prior information or fully buy into the Bayesian paradigm. 

The subject of this article is the use of Bayesian ideas and methods to pro- 
duce better inferences for hierarchical models via better point estimates of the 
hyperparameters. In that sense, this work falls into a long tradition of Bayesian 
tools used for practical non-Bayesian inferences (e.g., Agresti & Coull, 1998). 
Bayes modal (BM) estimation (or penalized likelihood) has also been used to 
obtain more stable estimates in item response theory (e.g., Mislevy, 1986; 
Swaminathan & Gifford, 1985; Tsutakawa & Lin, 1986) and to avoid boun- 
dary estimates (or logit parameters tending to +00) in log-linear models 
(Galindo-Garre, Vermunt, & Bergsma, 2004), logistic regression (Gelman, 
Jakulin, Pittau, & Su, 2008), varying-intercept models with constant coefficients 
(Chung, Rabe-Hesketh, Dorie, Gelman, & Liu, 2013), random-effects 
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meta-analysis models (Chung, Rabe-Hesketh, & Choi, 2013), and latent class anal- 
ysis (Galindo-Garre & Vermunt, 2006; Maris, 1999). Such an approach has also 
been used to obtain nondegenerate covariance matrices in factor analysis (Martin 
& McDonald, 1975), in finite mixtures of normal densities (Ciuperca, Ridolfi, & 
Idier, 2003; Vermunt & Magidson, 2005), and in multivariate regression (Warton, 
2008). In varying intercept models, the Stein loss function (Srivastava & Kubokawa, 
1999) and an extension of MANOVA estimation (Amemiya, 1985) have been used 
for obtaining nonnegative definite covariance estimators. 

The key problem solved by our method is the tendency of ML estimates of X 
to be degenerate, that is, on the border of positive definiteness, which corre- 
sponds to zero variance or perfect correlation among some linear combinations 
of the parameters. When the ML estimate of a hierarchical covariance matrix 
is degenerate, this often arises from a likelihood that is nearly flat in the relevant 
dimension and just happens to have a maximum at the boundary. 

Our solution is a class of weakly informative prior densities for X that go to zero 
on the boundary as X becomes degenerate, thus ensuring that the posterior mode 
(i.e., the maximum penalized likelihood estimate) is always nondegenerate. We 
recommend a class of Wishart priors with a default choice of hyperparameters, that 
is, the degrees of freedom is the dimension of b; plus 2 and the scale matrix is the 
identity matrix multiplied by a large enough number. This prior can be expressed 
as a product of gamma(1.5, 0) priors on the eigenvalues of © or as a product of 
gamma(1.5, 8) priors on variances of the varying effects with rate parameter 0 
— 0 and a function of the correlations (a beta prior in the two-dimensional case). 
In the varying-intercept model (Chung, Rabe-Hesketh, Dorie, et al., 2013) and 
random-effects meta-analysis model (Chung, Rabe-Hesketh, & Choi, 2013), the 
gamma(1.5, 9) prior successfully avoids boundary estimates while producing 
estimates that are consistent with the data. We show that this is also true for the 
default Wishart prior proposed in this article for general varying coefficient models. 

In a simulation study and an education example presented later, the default 
Wishart prior always gives nondegenerate estimates of & (in particular, nonperfect 
correlation coefficients) without decreasing the log likelihood substantially. The 
BM estimators of the standard deviations and correlations using the default Wishart 
prior have better statistical properties than the (restricted) ML estimators. 

When prior information is available for specific standard deviations or cor- 
relations, additional penalty functions may be included. Specifically, if the 
prior most plausible value for a standard deviation or correlation parameter 
is o* or p*, respectively, then we propose multiplying the Wishart prior by the 
gamma(2, 2/o*) or M(p*, .25”) densities. This assigns more prior probability 
around the preferred values while exploiting the property of the Wishart prior 
that it ensures that the estimates remain positive definite. 

The outline of the article is as follows. First, we illustrate the boundary esti- 
mation problems encountered in ML estimation of hierarchical variance and cov- 
ariance parameters. Then, we introduce the default Wishart prior for & and 
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investigate its properties. Next, additional penalty functions are proposed that 
incorporate further prior knowledge for some of the parameters. Finally, our 
method is applied to an example from education research and simulated data. 


Boundary Estimation Problem 


Consider the varying-coefficients model, 


Vij xB z;;b) bey i=1,...,4,f=1,...,J, (1) 


where y;; is the response variable for unit i in group j, Xx, is a p-dimensional cov- 
ariate vector with constant (or fixed) coefficients B, z,; is a d-dimensional covari- 
ate vector with varying coefficients b; ~ M(0, 2), and ej ~ N(0, 02) is a residual 
for each observation. We further assume that b; and ¢; are independent of each 
other and of the covariates (and suppress conditioning on covariates througout 
the paper). 


Non-Bayesian Point Estimation 


For each j, y; = (vy, aide Ynj) ~ N (Xp, Vj), where X; is an; x p matrix with 
j in the ith row, V; = ZjZZ? + o2/, and Z; is an, x d matrix with z;; in the ith 
row. The log-likelihood function is given by: 


x 


logp(yiB.2,02) = —5 » tog IM + S>(y— 9B) ¥"(y, - x8) (2) 


where the constant term, —(N/2)log(27), has been dropped. The ML estimator is 
obtained by maximizing the log-likelihood function. 

It is known that the ML estimator of the covariance matrix is biased for finite 
samples (Lehmann & Casella, 1998), and an often-preferred option is restricted 
maximum likelihood (REML; Patterson & Thompson, 1971), as it takes into 
account the degrees of freedom for the fixed coefficients B. Harville (1974) 
showed that the REML estimator can be derived by specifying flat prior distribu- 
tions for B, marginalizing over B, and maximizing the marginal (or restricted) 
likelihood with respect to © and o2. The restricted log-likelihood function is 
given by: 


J J 
1 7 
logpe(ylZ,o2) =— 5 ho XPV 1X] + So log |%}| 
j=1 j=1 
yen” aes : (3) 
+ (yy -%B) 4" (-¥8)} 
j=l 
up to a constant, where 
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FIGURE 1. Proportion of data sets, out of 1,000, where the maximum likelihood (ML) 
estimate of the covariance matrix is singular. (a) Two-dimensional case: When o = 
.25, 87% of the ML estimates are singular for J = 5. As o and J increase, the proportion 
decreases but is greater than 40% for the conditions considered. (b) Two to five dimen- 
sions 6 = 1: As the dimension of X increases, there is a rapid increase in the probability 
of the estimate being degenerate. 


J “ly 4 
Q Ty-1 Ty-l 
p= (Soary x) (Say s) 
iF i 


Singular Estimates of X using ML and REML 


ML and REML often yield singular (i.e., nonpositive definite) estimates of X. 
This boundary includes the cases where some varying coefficients have zero var- 
lance or a varying coefficient is a linear combination of the other varying 
coefficients. 

We present two simulation studies to demonstrate how often singular esti- 
mates of X occur in the varying-coefficients model. In the first study, we consider 
a model with two-dimensional varying coefficients, that is, a varying intercept bo; 
and a varying slope 5,;. We set the group size to n = 10 and the number of groups 
to J= 5 or 10. A covariate that varies within group only was generated from (0, 
1) and group-mean centered. The varying coefficients (bo;, b1;) were generated 
from M0, 07/5) with o = 0.25, 0.5, 0.75, 1. Setting the correlation to 0 corre- 
sponds to the best-case scenario in the sense of being furthest from the boundary. 
The within-group variance o? was set to | and the fixed coefficients Bo and B, 
were set to 0. For each of 1,000 random samples of data from the model, we 
obtained ML and REML estimates using Imer (Bates & Maechler, 2010) in R. 

Figure la shows the proportion of ML estimates of X on the boundary for the 
two-dimensional case. For J = 5 groups, 87% of the ML estimates are singular 
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when o = 0.25 and the proportion decreases as o increases but remains as high as 
72% when o = 1. For J = 10 groups, the proportions are smaller than those for 
J = 5 but still, in more than 40% of the simulations, the likelihood is maximized 
at a singular %. The REML estimator yields smaller proportions of singular esti- 
mates with a similar trend (not shown). For J = 10, 79% and 64% of the REML 
estimates are singular when o = 0.25 and o = 1, respectively. For J = 10, the 
proportion is reduced to 69% and 35% when o = 0.25 and o = I, respectively. 

Our second simulation study considers various dimensions, from d = 2 to 
d = 5, each time with a varying intercept and d — 1 varying slopes for n = 
10 and J = 5 or 10. The d — 1 covariates were independently drawn from 
N(O, 1) and centered at their group means as in the previous simulation. The 
varying coefficients b; were drawn from M(0, J,) and o2 was set to 1. 
Figure |b presents the proportion of replicates where the ML estimate > is sin- 
gular. As the number of dimensions increases, this proportion increases rapidly, 
exceeding 95% with five varying coefficients for both J = 5 and J = 10. For 
REML, the proportions of singular estimates are slightly lower than for ML but 
follow a similar pattern and exceed 35% across all simulation conditions. 

In some contexts, singular estimates of the covariance matrix are acceptable 
or considered as an indication of structural misspecification of the model. In the 
varying-intercept model, a negative group-level variance estimate is sometimes 
permitted if the model is viewed as a marginal model for the responses, given the 
covariates where only the sum of the group-level and within-group variance must 
be positive (Verbeke & Molenberghs, 2000, pp. 52-53). In factor analysis and 
structural equation models, a negative variance estimate, called a Heywood case, 
is sometimes interpreted as model misspecification, especially if the null 
hypothesis that the variance is nonnegative can be rejected (Kolenikov & Bollen, 
2012). However, this article takes a hierarchical perspective of the multilevel 
linear model, where the intercepts and slopes vary due to omitted group-level 
variables. Therefore, the variances of the varying coefficients must be 
nonnegative, and perfect correlations among linear combinations of varying 
coefficients are regarded as unrealistic. 


Weakly Informative Wishart Prior for © 


We propose posterior modal estimation with a prior on &, implicitly assum- 
ing uniform priors for the other parameters. With a prior p(X), the log-posterior 
function can be written as follows: 


log p(B, 2, oly) = logp(y|B, 2, 0.) + logp(Z) +, (4) 


and we find the mode of log p(B, 2, o.|y). This approach can also be viewed as 
maximum penalized likelihood estimation where log p(X) is a penalty function. 
We consider a family of Wishart (not inverse-Wishart) densities for the prior on 
2. The Wishart density function on X with hyperparameters v and is defined by: 
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|D|%-4-Y? exp [- Lir(P-"2)] 
24/2) —p |"? D4 (v/2) 


p(2) = y>d-1,¥>0, (5) 
where T'y(v/2) = 14(¢-D/4 Te I'(v/2+ (1 —j)/2), v is the degrees of free- 
dom, and is a scale matrix with E(Z) = v'P. 

If we set Y to be a diagonal matrix (1/20)/,, the Wishart density of & in 
Equation 5 can be written as: 


g@/? (v—d—1)/2 
x)= x exp(—Otr(= 
P2) =p p(—6r()) 
dv/2 _d 
= 8 sy /? exp(—0,) 
Ta (v/2) r=1 
y—d+l1 (6) 
dr =a =e 0 ’ 
* Is ( ia ) 
where ;,..., Aq are the eigenvalues of & and g(x|a, 0) is the gamma(a, 0) den- 


sity with shape parameter « and rate parameter 0, g(x|x, 0) = tox! exp(—x6). 
In the previous equations, note that we do not transform the density of X to the 
density of eigenvalues, but just rewrite Equation 5 as a function of eigenvalues 
without including a Jacobian term. 

As a default choice, we propose v = d + 2 and 0 — 0. In practice, we can 
choose a sufficiently small number for 0, for example, 0 = 10-* or 10°. If 
these two values of 0 lead to almost the same parameter estimates, we can con- 
sider the choice of 0 to be sufficiently close to the limit 0. In order to avoid 


dependency on the scale of the response variable, we can also use an improper 


prior pe * which is the same as the Wishart prior up to constant in the 
limit 6 — 0 (Chung, Rabe-Hesketh, Dorie, et al., 2013). This prior is propor- 
tional to independent gamma(1.5, 8) densities of the eigenvalues as observed 
in Equation 6. If X is a diagonal matrix, this prior implies gamma(1.5, 8) priors 
on the diagonal elements of X, which is equivalent to gamma(2, 8) priors on the 
standard deviations when 0 — 0. If X is not diagonal, we obtain gamma(1.5, 8) 
priors on the variances and a function of the correlations. 

The advantage of this family of density functions is that they equal zero at the 
boundary—thus, the BM or penalized likelihood estimate for & will never be 
degenerate—but the densities move away from zero when & moves off the 
boundary, so that the posterior mode can be arbitrarily close to degeneracy if this 
is what the data demand. In contrast, various other families of models do not have 
these properties, making them less desirable when used for the purpose of BM 
point estimation. The inverse-Wishart family of density, one of the most com- 
monly used priors for & in the full Bayesian inference, is also zero at the 
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boundary. However, it tends to assign an excessive penalty near the boundary 
because it is a function of X~! and |X|~' while the Wishart density is a function 
of & and |=]. 

Alternative choices of v and 0 can be considered but v and 0 larger than the 
default choice will make the prior more informative. This behavior might be 
preferable if a plausible value of & is available. In the next section, we suggest 
including additional prior information about any specific standard deviation by 
multiplying the default prior by an additional penalty function, which can be 
viewed as a special case of the Wishart prior with larger v and 0. 

Priors on the covariance matrix in the varying-coefficients model have been 
investigated by several authors in the context of full Bayesian modeling. Daniels 
and Kass (1999) investigated nonconjugate Bayesian estimation of covariance 
matrices in hierarchical models including an inverse-Wishart prior on covariance 
matrices with unknown scale and degrees of freedom and a normal prior on 
Fisher’s z-transformed correlations. Barnard, McCulloch and Meng (2000) 
decomposed & = Diag(s)RDiag(s) where s is a vector of standard deviations 
and R is the correlation matrix, which is assigned marginal or jointly uniform 
priors. O’Malley and Zaslavsky (2005) propose a scaled inverse Wishart, a 
decomposition similar to that of Barnard, McCulloch, and Meng (2000) except 
that the central matrix R itself has an inverse-Wishart distribution rather than 
being constrained to be a correlation matrix. Our approach is different from 
these others in being explicitly intended not for full Bayes inference but as a 
tool to obtain positive definite posterior modal estimates. As such, our concerns 
are different from those involved in constructing traditional Bayesian priors. 

Unlike posterior mean estimation, BM estimation does not involve simulation 
and is computationally as efficient as ML estimation. By modifying existing ML 
estimation procedures, gllamm (Rabe-Hesketh, Skrondal, & Pickles, 2005) in 
Stata and Imer (Bates & Maechler, 2010) in R, we have developed software to 
find the maximum of the penalized likelihood. The modified gllamm is available 
from www.gllamm.org and blmer, the modified Imer function, can be found in 
the blme package available from the Comprehensive R Archive Network. 


Varying-Intercept Models: d = 1 
The varying-intercept model is a special case of the model in Equation | with 
d = 1, given by: 
Vig = XB + bj + 8%, 


where b; ~ N(0,0%) and ¢,; ~ N(0,02). The Wishart prior in Equation 6 is 
equivalent to a gamma(v/2,0) prior on o%. With the default choice of hyperpara- 
meters, vy =3(=d-+2) and 0 — 0, the Wishart prior coincides with a 


gamma(1.5, 0) prior on 0. 


143 


Downloaded from http://jebs.aera.net at COLUMBIA UNIV on April 30, 2015 


Weakly Informative Prior 


When 0 — 0, the gamma(1.5, 0) prior on o% has a density function proportional to 
0», which is also proportional to the gamma(2, 8) prior on o,. The gamma(2, 8) prior 
on 6, is recommended as a weakly informative prior for avoiding estimates of o, 
equal to zero in the varying-intercept model (Chung, Rabe-Hesketh, Dorie, et al., 
2013) and in random-effects meta-analysis models (Chung, Rabe-Hesketh, & Choi, 
2013). Since the gamma(2, 0) prior is 0 at o,, = 0, the posterior density is also 0 at o, 
= 0 and thus the posterior mode of 0, is always strictly positive. In addition, since 
the gamma density has a positive constant derivative at o, = 0, the gamma(2, @) den- 
sity increases linearly at zero. It follows that the profile likelihood of o, (maximized 
over all the other parameters) dominates the posterior density of 0, if the likelihood 
is strongly curved near 0; = 0. That is, the prior does not rule out positive values near 
zero if they are supported by the likelihood. Chung, Rabe-Hesketh, Dorie, et al. 
(2013) show that the posterior mode is approximately one standard error away from 
zero when the ML estimate of o,, is zero. Finally, the estimator behaves reasonably 
well in terms of mean squared error of parameter estimates and coverage of 
confidence intervals for fixed parameters. 

In the context of small area estimation, strictly positive group-level variance 
estimators have been proposed for the Fay and Herriot model (1979), a varying- 
intercept model for aggregated group-level data and known heterogeneous 
within-group variances. Adjustment for density maximization (Li & Lahiri, 2010; 
Morris, 2006; Morris & Tang, 2011) applies a penalty term 1(o7) = (o to 
the likelihood, and this approach turns out to be equivalent to posterior modal esti- 
mation with a gamma(, 8) prior on o, with « = 2c + 1 and @ — 1. Therefore, for 
this specific varying-intercept model, our estimator shares the properties of adjust- 
ment for density maximization, such as predictions of the group means being mini- 
max for mean squared-error loss when the within-group variances are equal and 
c <1 (Morris & Tang, 2011). 


Varying-Intercept and Varying-Slope Models: d = 2 


When d = 2, the model includes a varying intercept and a varying slope of one 
covariate, written as: 


T 
Vig = XB + boy + byzy + ey, 


where (bo;, b1;) ~ N(0, 2) and ¢ ~ N(0, 02). 

As shown in Equation 6, with the default choice v = d + 2, the Wishart den- 
sity can be written as a product of gamma(1.5, 0) densities on the eigenvalues , 
and 3. For the bivariate case, we can also express the default prior as a function 
of the variances (of and 03) and the correlation (~) between the two varying 
effects bo; and b,;, given by: 


P(E) x |Z I! = o1o2V/1 = p?. (7) 
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FIGURE 2. Conditional density of p;; with Wishart (A + 2, (1/20)1) on X, 0 = 10“, where 
the other parameters are randomly generated from the Wishart distribution for 20 repli- 
cates. When d = 2, the conditional density is beta(1.5, 1.5), but for larger d, the curves are 
more scattered and the supports of the densities become narrower. 


This expression implies that Wishart(4, (1/20)/,) with 8 — 0 is equivalent 
to the joint density of independent gamma(1.5, 9) priors on both ot and o3, 
and a beta(1.5, 1.5) prior on (p + 1)/2. 

Since the beta(1.5, 1.5) prior on (p + 1)/2 is zero at the boundaries p = +1, 
the posterior mode of & cannot be attained at any matrices with perfect correla- 
tion. In addition, the beta(1.5, 1.5) density function increases rapidly as p 
approaches 0 from +1 and so does not rule out values close to + 1. The left panel 
of Figure 2 shows the beta(1.5, 1.5) density on (p + 1)/2. Whereas gamma(2, 0) 
increases linearly at 0, the slopes of beta(1.5, 1.5) at +1 are +oo. Therefore, 
compared to the gamma(2, 9) prior for 0; and 6», the beta(1.5, 1.5) for p is less 
informative with lower penalties on the values around the boundaries. 

The beta priors have been used to avoid boundary estimates of the probability 
parameter p of the binomial distribution. When the sample proportion is 0 or 1, the 
traditional Wald confidence interval for p degenerates to the point estimate. To 
avoid such boundary estimates, Agresti and Coull (1998) specified a beta(2, 2) 
prior on p. The posterior mean of p then is the sample proportion after adding two 
successes and two failures to the data. Compared with the beta(2, 2), the beta(1.5, 
1.5) tends to assign less penalty at the boundaries and so is less informative. 


Higher Dimensional Case: d > 3 


Similar to the case d= 2, the default prior for d > 3 can be written as a product 
of o,,r = 1,..., d and a function of p,.,, the correlation between the 7th and sth 
varying effects (0<r<s,s =2,...,d). For example with d = 3, the Wishart(5, 
(1/20)/5) prior with 6 — 0 can be written as: 


p(d) x |z|'/? x 0102034/1 Pir — P33 — PIs + 2P12P23P13- (8) 
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d=2 d=5 d=10 


Density 
Density 
Density 


FIGURE 3. Marginal density of p, with Wishart (d + 2,(1/20)I), 8 = 10-*. When d = 2, 
the marginal density of p is equivalent to beta(1.5, 1.5) on (p + 1)/2 (solid curve). As d 
increases, the marginal density has more mass around 0 due to the positive semidefinite 
constraint of the covariance matrix. 


This is a product of gamma(1.5, 8) priors on the variances and a function 
of the correlations. This function depends on the squares of the correlations, 
as in the two-dimensional case (Equation 7), but also contains the product of 
three correlations, which comes from the constraint |x| > 0 that defines the 
support of Wishart distributions. Because of this constraint, the Wishart 
prior automatically restricts the posterior mode of & to be strictly positive 
definite. 

The graphs in Figure 2 show the conditional densities of 912 when X follows 
the Wishart(d + 2,(1/20)/,), 8 = 10~*. The curves are the density of p; con- 
ditional on the other parameter values (standard deviations and the other cor- 
relations) that are randomly generated from Wishart(d + 2,(1/20)/) with 20 
replicates. When d = 2, the correlation follows beta(1.5, 1.5) as discussed pre- 
viously. When d = 3, the curves have distinct supports, defined by 
1 — (p12)” — (p93) — (p83)? + 2pirpspls > 0 where pi; and p$; for each 
replicate are given by randomly generated &. The curves for d = 5 are more 
scattered and the supports of the densities tend to be narrower than for d = 
2 and 3 due to more restrictions required for the higher dimensional & to be 
positive definite. 

The marginal prior densities of p,, are displayed in Figure 3 for d = 2, 5, 
and 10. With 10,000 replicates, d-dimensional matrices were randomly gener- 
ated from the Wishart(d + 2,(1/20)/) with 0 = 10~* and 10,000(d — 1)(d — 2)/ 
2 correlation coefficients were used to construct the histograms. For d = 2 
(left), the distribution of the correlation coefficient matches the beta(1.5, 
1.5) density, shown as a solid curve. As d increases, the marginal prior density 
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FIGURE 4. Conditional prior density of p12 with additional N(—.5, .25°) (left and middle) 
and N(.5, .257) (right) densities multiplying the default Wishart prior. The Wishart prior is 
on three-dimensional & and ~ 713 and (23 are fixed as 0 (left) and .5 (middle and right). The 
additional normal penalty makes the prior density skewed toward the prior value, but still 
enforces positive definiteness. 


of p,, becomes more concentrated around zero because of the positive definite- 
ness of X. 


Incorporating Additional Prior Information 


In the previous section, we suggested the Wishart(d + 2,(1/20)/) with 8 — 
0 as a default prior when no other information is available. If a researcher has 
additional prior knowledge about any specific standard deviations or correla- 
tions, he or she might want to adjust the prior to incorporate such information. 
In this section, we suggest multiplying the Wishart prior by functions of the 
parameters on which we have information. Because the Wishart density 
ensures that X is positive definite, we can choose the functions for the other 
parameters to be intuitive and easy to specify without regard for the parameter 
space. 

If o* isa plausible value for o,, then the gamma(2,2/o*) density is recommended 
as a penalty. Recall that the default Wishart prior is proportional to gamma(2,0) 
priors with 8 — 0 on each standard deviation, multiplied by a function of the 
correlations. When the gamma(2,2/o*) density of o is multiplied by the Wishart, 
the part including 6, becomes 06? exp(—20,/0*). This is proportional to the 
gamma(3,2/o*) density that has its mode at o,.= o*. The gamma prior with shape 
parameter greater than two assigns more penalty near zero than for shape parameter 
equal to two. Therefore, we have a more informative prior with mode at o*. 

If any specific correlation p,, is believed to be close to p*, we can incorpo- 
rate this prior information by multiplying the default Wishart prior by a 
N(p*,t”) density. As usual, the scale parameter t can be chosen depending 
on the prior uncertainty regarding p,,. A possible default choice is t = .25 
because it is the standard deviation of the beta(1.5,1.5) distribution. Figure 4 
displays the shape of conditional prior densities of p,2 with additional normal 
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priors in the three-dimensional case. When /;3 and po3 are fixed at zero (left), 
the default Wishart(5,(1/20)/;) prior (solid curve) is pretty flat. In order to 
incorporate the prior information, for example, p* = —.5, the Wishart is multi- 
plied by the M(—.5, .257) density, and then the prior mode moves toward —.5 
(dashed curve). When (13 and p23 are .5 (middle and right), the support of the 
Wishart for p12 is on [—0.5, 1] because of the constraint of positive definite- 


ness. When our prior value is on the boundary p* = —.5 (middle), the Wishart 
multiplied by M(—.5, .25”) density is skewed toward —.5, but still enforces pos- 
itive definiteness. When the prior value is inside the support, p* = —.5, the 


resulting density is less skewed (right). 

The default prior for in the two-dimensional case is beta(1.5, 1.5), and so it 
would seem natural to use the beta family for p,, when constructing an additional 
penalty. However, the parameters of the normal distribution are more intuitive 
because they represent the prior mean (and mode) and variance. In addition, since 
the positive definiteness of = is already guaranteed by the Wishart prior, estimates 
of 2 remain positive definite regardless of the type of additional penalties that multi- 
ply the Wishart prior. Furthermore, computation is no problem in any case; includ- 
ing any closed-form prior density adds essentially no cost to the optimization. 


Example: A Varying Intercept, Varying Slope Model in Education Research 


We illustrate our approach using a study of Heller et al. (2007) on the effects of 
the Mathematics Pathways and Pitfalls (MPP) teacher professional development 
program on mathematics learning for students at different levels of English lan- 
guage proficiency. Half of the 36 teachers were randomized to MPP and the other 
half to the control condition. Teachers randomized to the MPP condition were 
taught how to use the materials and then substituted MPP for part of their mathe- 
matics curriculum during the 2003—2004 school year, while control teachers used 
their regular mathematics curriculum. All students received an MPP test as a pret- 
est before the lessons and took the same test after the lessons as a posttest. 

Posttest scores are regressed on the mean-centered pretest scores, an indicator 
for treatment group (1 for MPP and 0 for control), English language learner 
(ELL) status (1 for ELL and 0 for non-ELL), and the Treatment x ELL Interac- 
tion Term. A varying intercept and a varying slope for ELL status are included to 
allow for the cluster-randomized design. The model can be written as follows: 


T 
Vij = X{B + bo + byzy + Ey, 


where y;; is the posttest score for the ith student of the jth teacher, x;; is the cov- 
arlate vector that includes the mean-centered pretest score, the treatment group 
indicator, ELL status, and the interaction between ELL status and treatment, and 
zy is ELL status. As usual, we assume (bo;,1;) ~ N(0,2) and ej ~ N(0, 02). 
After dropping observations with missing values on any of the variables, data 
were available on 755 students and J = 36 teachers, with between 12 and 27 
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TABLE 1. 
Parameter estimates for education example 
ML REML BM 
Fixed effect 
Intercept 32.39 (2.01) 32.40 (2.07) 32.31 (2.11) 
Pretest 0.56 (0.06) 0.56 (0.06) 0.56 (0.06) 
Treatment 12.84 (3.15) 12.81 (3.24) 13.01 (3.30) 
ELL —2.46 (2.73) —2.54 (2.77) —2.66 (3.17) 
ELL x Treatment 1.00 (4.13) 1.24 (4.19) 1.56 (4.84) 
Varying effect (group: teacher) 
Intercept SD 8.31 (1.18) 8.62 (1.25) 8.50 (1.22) 
ELL SD 0.71 (2.09) 0.48 (2.18) 3.64 (2.50) 
Correlation —1.00 (2.93) —1.00 (0.00) —0.32 (0.22) 
Residual SD 226.5 227.3 226.3 
Log likelihood —3,153.7 —3,153.8 —3,154.2 


Note. ELL = English language learner; SD = standard deviation; ML = maximum likelihood; 
REML = restricted maximum likelihood; BM = Bayes Modal. The ML and REML estimates imply 
perfect correlation between the varying intercept and varying slope, whereas BM produces more 
reasonable estimates. The log likelihood stays almost the same among the three methods. We present 
results here to more decimal places than would be recommended in practice in order to display the 
sometimes-small differences between the different estimates. 


students per teacher. We fit the models by ML and REML using Imer in the Ime4 
package and by BM using blmer in the blme package. 

Table 1 presents ML, REML, and BM estimates with the default Wishart(4,(1/ 
20)[,) prior with 9 = 10~*. Both ML and REML estimates of the correlation 
between bo; and b,; are —1. This implies an unrealistic perfect correlation 
between the teacher-level slopes and intercepts. The BM estimate of p is —.32 
and the standard deviation estimate of the varying slope for ELL status increases 
from 0.71 for ML and 0.48 for REML to 3.64, a change that is within the uncer- 
tainty implied by the asymptotic standard error of 2.1 (ML) or 2.2 (REML) for 
that parameter. The standard deviation of the varying intercept stays similar for 
ML, REML, and BM. 

The fixed coefficient estimates are similar across estimation methods. The 
coefficient for the interaction term between ELL and treatment changes the most 
among all the fixed coefficients, but the differences are negligible considering 
that the standard errors of the interaction term are greater than 4. The standard 
errors of the fixed coefficient estimates of Treatment, ELL, and Treatment by 
ELL are larger for BM than for ML or REML, suggesting that ML and REML 
underestimate the uncertainty. 

The log likelihood at the BM estimates differs from the maximum by less 
than |. Figure 5 shows the profile likelihood of p (profiling out all the other 
parameters) divided by its maximum. Although the ML is attained at p = —1, the 


149 


Downloaded from http://jebs.aera.net at COLUMBIA UNIV on April 30, 2015 


Weakly Informative Prior 


Profile likelihood / maximum likelihood 


FIGURE 5. Profile likelihood of p. The maximum likelihood estimate of p is —1 but the 
likelihood has very little information. Therefore, the Bayes modal estimate of —0.3 is also 
well supported by the data. 


profile likelihood is very flat and so the minimum (at p = 1) is attained with only 
an 8% decrement from the maximum. Therefore, all the values of p including 
p = —0.32 are well supported by the data. As is typical in such settings, there is 
nothing special about the point estimate on the boundary, and it would be inap- 
propriate for a researcher to use that estimate. Our BM approach gives a default 
procedure that allows a classical statistician to avoid the inappropriate degenerate 
estimate. A full Bayes approach using real prior information would do better, but 
our BM approach takes us a bit in the right direction and has the advantage of being 
fast and easy to implement. 

When a researcher is interested in comparing teacher-specific effects, bo; and 
b,; can be predicted by their conditional posterior means (or modes), given the 
estimates of the model parameters and the data (called empirical Bayes predic- 
tion or best linear unbiased prediction). 

In Figure 6, scatter plots of empirical Bayes predictions of b,; versus bo; are 
displayed with the proportion of ELL students of each teacher represented by the 
gray scale, that is, black indicates all the students are ELL and white indicates 
none are ELL. The sizes of the squares are proportional to the numbers of stu- 
dents for each teacher. For ML (left), due to the estimate p = —1, the slopes 
b,; are predicted perfectly linearly by the intercepts bo;. In contrast, BM (right) 
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FIGURE 6. Empirical Bayes predictions of varying effects. The size of each square rep- 
resents n, for the jth teacher. The ratio of English language learner (ELL) students for 
each teacher is shown on a gray scale, that is, black indicates all the students are ELL 
and white indicates none are ELL. For maximum likelihood, bj, are predicted perfectly 
linearly in bo;. On the right graph, Bayes modal estimation shows more reasonable pre- 
dictions for the varying slopes and intercepts. 


shows more reasonable predictions for the varying slopes and intercepts. In addi- 
tion, we can observe that 18 (out of 36) white squares with a gray border fall per- 
fectly on a line—these are teachers without any ELL students in their classes. 
Four black squares (only three visible due to overlap) correspond to teachers with 
only ELL students. The 18 groups without ELL students and the 4 groups with 
only ELL students do not provide any information about the slope variance 
and intercept variance, respectively, and none of the 22 groups provide 
information about the correlation between the varying slope and the 
intercept. This lack of information could be one of the reasons we obtain the 
boundary estimates using ML and REML. As the group size increases (i.e., 
the square increases) and the proportion of ELL students increases (i.e., the 
square gets darker), the empirical Bayes predictions tend to be less shrunken 
toward the line formed by the white squares. 

Using the fitted covariance matrix, we can calculate the marginal variances 
and correlations of the posttest score given ELL status. The variance of the postt- 
est scores for ELL students is Var(yj|z, = 1) = of + 05 + 2012 + 02 and, simi- 
larly, the variance for non-ELL student is Var(y,|z; = 0) =o} +02. The 
covariance between the posttest scores of two students of the same teacher is 
Cov (yi, ¥iejlzy = 1,zej = 1) = Of +.03+ 2012 if both students are ELL, 
Cov (yi, Yr; 24 = 1,2"; = 0) — at +06). if one student is ELL, and 
Cov (yi, ¥iejlzi = 0, Z-; = 0) = Of if neither student is ELL. 
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TABLE 2. 
Marginal standard deviations and correlations of posttest scores given ELL status 
ML BM 
SD of ELL student 16.86 17.06 
SD of non-ELL student 17.19 17.26 
Correlation of (ELL, ELL) 0.20 0.22 
Correlation of (ELL, non-ELL) 0.22 0.21 
Correlation of (non-ELL, non-ELL) 0.23 0.24 


Note. ELL = English language learner; BM = Bayes modal; SD = standard deviation; ML = maximum 
likelihood. These values do not differ much between ML and BM although the slope standard deviation 
estimate and correlation estimate increased notably from ML to BM. 


Table 2 shows these model-implied marginal standard deviations and 
correlations with estimates from ML and BM substituted for the parameters. 
These standard deviation and correlation estimates are remarkably similar, 
which also explains why the log likelihood evaluated at the BM estimates 
is not much smaller than that evaluated at the ML estimates. 


Simulation 


We simulated data from the varying coefficient model as described in the pre- 
liminary simulation for Figure 1 but with only one covariate. We explored differ- 
ent values of the correlation p(0, 0.225, 0.450, 0.675, and 0.900), setting o to be a 
moderate value of 0.5. With 1,000 replicated samples generated with J = 5 and 
n = 30, we estimated the bias and root mean squared error (RMSE) for 0), 09, 
and p. For ML and REML, the bias and RMSE of p are based on the replicates 
that generate legitimate estimates (i.e., when neither G; nor G2 is zero which hap- 
pened in 1.2% of the replicates for ML and 0.9% of the replicates for REML). For 
BM estimation, we assigned a Wishart(4,(1/20)/) prior on © with 0 = 107“. 

Figure 7 shows the proportion of boundary estimates of p (where 1— 
|0| < 10~°). When p is 0, 21% of the ML estimates and 17% of the REML 
estimates have perfect correlations. As p increases, the proportion of p on the bound- 
ary also increases and reaches 60% for ML and 51% for REML. The BM method does 
not produce any boundary estimates of p for any of the simulation conditions. 

In spite of the absence of boundary estimates, the log likelihood is not reduced 
substantially by using BM estimation. Investigating the difference in deviances 
(= 2[log L(=ur) - log L(Zpm)}) for all the replicates, the BM method never 
reduces the log likelihood by more than 2.2 from the maximum. 

Figure 8 summarizes the estimated bias and RMSE of (, 6), and G2. When 
p = 0, the estimated bias of # is almost zero for all three methods. 

ML, REML, and BM all have some bias in estimating p, with BM having the 
most bias (i.e., the most shrinkage toward 0), as would be expected given the 
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FIGURE 7. Proportion of maximum likelihood (ML) and restricted maximum likelihood 
(REML) estimates of p that are on the boundary. When p = 0, 21% of the ML estimates 
and 17% of the REML estimates are +1. As p increases, the proportion of estimates on 
the boundary, ~ equal to + 1, also increases and reaches 60% for ML and 51% for REML 
when p = .9. 


regularization from the Wishart prior that squeezes p toward zero as seen in 
the shape of the prior density for p with d = 2 in Figure 3. However, BM gives 
the smallest estimated RMSE of p. The estimated bias of G6; and G2 is similar 
across the different values of p for all the estimation methods. The BM esti- 
mates of the standard deviations are less biased and have smaller estimated 
RMSE than ML and REML. 

The coverage of 95% confidence intervals for By and B, does not change much 
with p. The average coverage of the BM confidence intervals is .940 for By and 
.943 for B,. The coverage for REML is about the same as that for BM, whereas 
ML shows slightly lower coverage with averages of .935 for By and .937 for By. 


Conclusion 


For the hierarchical regression model, particularly with several varying coef- 
ficients, degenerate covariance matrix estimates do not have a practical interpre- 
tation. Unfortunately, such boundary estimates commonly arise in ML estimation 
because there is often little information on these parameters when there is only a 
moderate number of groups. In addition, when & is singular, underestimated stan- 
dard errors of the fixed coefficients make the researcher overconfident about the 


153 


Downloaded from http://jebs.aera.net at COLUMBIA UNIV on April 30, 2015 


Weakly Informative Prior 


p O1 02 
a _+ BM a | 
o aw cs 
| - “¢——_8—_* ———_,) 
- ¥ . ML 
s : 8 
o So 
g 
oa , & 
ace) 34 Be ee ba REML 
Sy Conn neers aes settee 
1 passeese Bessecsen Beserest Bares AREML ° ry 
ee ee spon tee. 
sisi donee dbo ent BM +BM 
8 8 
S T T T ra ell T T T T4 S T T T T1 
i) 0.225 0.45, 0.675 09 1 0 0.225 0.45, 0.675 09 1 0 0.225 0.45 0.675 09 1 


P Pp 


1 


RMSE 


00 01 02 03 04 05 06 0.7 


seeps btn: BM 


L 


F 
0.225 «4045 0675 09 1 0 0.225 045 0675 09 1 
Pp p 


° 
° 
v4 
8 
& 
2 
B 

ed 
° 
a 
2 
a 
° 
© 
° 


FIGURE 8. Bias and root mean squared error of 6, G2, and p with J = 5 and n = 30 of 
the varying-coefficient model with 0; = 02 = .5 and p in the grid. In our simulation, with 
p set to various positive values, the bias values are all negative, so we display absolute 
values to make the graphs easier to read given the convention that high values of bias are 
bad. Bayes modal (BM) has higher bias for p (i.e., shrinking the estimate toward 0) com- 
pared to maximum likelihood (ML) and restricted maximum likelihood (REML), but the 
RMSE is smaller for BM. For both o; and o>, BM has smaller bias and RMSE than 
ML and REML. 


effect of the covariates. When a boundary estimate is attained but no prior infor- 
mation is available for X, the BM estimator using the default Wishart prior is rec- 
ommended because it ensures strictly positive definite > and is weakly 
informative at the same time. The modified gllamm from www.gllamm.org for 
Stata and blme package for R allow straightforward application of our method 
for practitioners. 

In varying-slope models, changing the location and scale of the covariates 
that have varying slopes implies that & must change to produce an equivalent 
model. For example, for longitudinal data, we might want to transform the 
time variable to have a value 0 at the initial time point. In this case, subtract- 
ing a constant from the covariate changes the variance of the varying inter- 
cepts and the correlation between intercepts and slopes. Although ML and 
REML will yield equivalent models after linearly transforming the covariate, 
this is no longer true for BM estimation, which pulls the correlation toward 0. 
When using Bayesian regularization in this setting, it therefore becomes more 
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important to choose meaningful centering points for the covariates with vary- 
ing coefficients. 
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